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[57] ABSTRACT 

Each time a matching unit 22 finishes a matching at a time 
"i" between feature vectors of input speech which have been 
converted by an analyzer 1 and feature vectors of reference 
patterns stored in a reference pattern memory 2, and goes to 
a matching at a next time "i+1", a feature vector integrator 
24 multiplies the feature vectors of the input speech or the 
reference patterns by a weight wl in each of acoustic 
categories, successively integrates products, stores inte- 
grated values in respective frames of the reference patterns 
in feature vector accumulating buffers 26, and integrates and 
stores weights wl in feature vector weight counters 27 
corresponding to the respective acoustic categories. When 
the feature vector integrator 24 finishes integrating and 
storing processes corresponding to the final matching 
process, a mean value calculator 28 divides the values stored 
in the final frames in the feature vector accumulating buffers 
26 by the values stored in the feature vector weight counters 
27, and outputs the quotients as mean values in the respec- 
tive acoustic categories of the input speech or the reference 
patterns. An adaptation unit 55 adapts one or both of the 
input speech or the reference patterns using the mean values. 

7 Claims, 4 Drawing Sheets 
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ACOUSTIC CATEGORY MEAN VALUE 
CALCULATING APPARATUS AND 
ADAPTATION APPARATUS 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

The present invention relates generally to speech 
recognition, and mare particularly to an improvement in the 
technique of calculating the mean value of each acoustic 
category that is necessary to effect speaker adaptation of 
input speech or reference patterns. 

2. Description of the Related Art 

Various different speech recognition techniques have been 
known depending on the nature and level of technology. The 
basic principles behind the existing speech recognition tech- 
niques are as follows: Utterances to be recognized are 
analyzed in a training or registering mode, and stored as 
reference patterns. An unknown utterance that is uttered by 
a speaker is analyzed in a recognition or testing mode, and 
the pattern produced as a result of the analysis is compared 
successively with the reference patterns. Then, a result that 
corresponds to one of the reference patterns which best 
matches the pattern is outputted as the recognized utterance. 

Among various speech recognition systems, a speaker- 
independent speech recognition system is widely used, in 
which utterances of many speakers are registered as refer- 
ence patterns to accommodate the distribution of the speaker 
individualities. Therefore, the speaker-independent speech 
recognition system is capable of recognizing utterances of 
an unknown speaker at a relatively high speech recognition 
rate regardless of speech sound variations in different speak- 
ers. 

However, the speaker-independent speech recognition 
system is disadvantageous in that it cannot achieve a high 
performance if unknown utterances that are inputted are 
largely different from those registered as reference patterns. 
It is also known that the speech recognition rate of the 
system is degraded if a microphone used to record testing 
utterance is different from the microphones that were used to 
record utterances to provide reference patterns. 

A technique which is known as "speaker adaptation" to 
improve the speech recognition rate has been proposed. The 
speaker adaptation process employs relatively few utter- 
ances provided by a specific speaker or a specific micro- 
phone to adapt reference patterns to the utterances. One 
example of the speaker adaptation method is disclosed by K. 
Shinoda et al. "Speaker Adaptation on Using Spectral Inter- 
polation for Speech Recognition", Trans, of TJEICE (Jap,), 
vol. J 77-A, No. pp. 120-127, February 1994 (hereinafter 
referred to as "literature 1"). 

A conventional speech recognition system used for 
speaker adaptation wfll be described below with reference to 
FIG. 1 of the accompanying drawings. 

As shown in FIG. 1, the conventional speech recognition 
system comprises an analyzer 1 for converting input speech 
into a time sequence of feature vectors, a reference pattern 
memory 2 for storing reference patterns, i.e., a time 
sequence of feature vectors that have been converted from 
training utterances and contain weighting information for 
each acoustic category, a matching unit 12 for comparing the 
time sequence of feature vectors of input utterances and the 
reference patterns to determine an optimum path and a 
time-alignment between the input utterances and the refer- 
ence patterns, a backtracking information memory 14 far 
storing two-dimensional information associated by the 
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matching unit 12, a template information memory 16 for 
storing template information, i.e., the index information of a 
template that indicates which template has been used at 
respective grid points if the template is a multiple template 

5 having a plurality of reference patterns, and a mean vector 
calculator 18 for carrying out a backtracking process to 
tetermine which reference pattern is associated with the 
input speech at each time, based on the two-dimensional 
associated information stored in the backtracking informa- 

10 tion memory 14. Both the backtracking information memory 
14 and the template information memory 16 have a two- 
dimensional storage area having a size of (length of input 
speech)x(length of reference pattern). 
The analyzer 1 may convert input speech into a time 

15 sequence of feature vectors according to any of various 
spectral analysis processes. These various spectral analysis 
processes include a method of employing output signals 
from a band-pass filter bank in 10 through 30 channels, a 
nonparametric spectral analysis method, a linear prediction 

20 coding (LPC) method, and a method of obtaining various 
multidimensional vectors representing short-time spectrums 
of input speech with various parameters including a spec- 
trum directly calculated from a waveform by Fast Fourier 
Transform (FFT), a cepstrum which is an inverse Fourier 

25 transform of the logarithm of a short-time amplitude spec- 
trum of a waveform, an autocorrelation function, and a 
spectral envelope produced by LPC. 

Generally, feature vectors that are extracted as represent- 
ing speech features from input speech using discrete times as 

30 a frame include power information, a change in power 
information, a cepstrum, and a linear regression coefficient 
of a cepstrum. Spectrums themselves and logarithmic spec- 
trums are also used as feature vectors. 

35 Speech of a standard speaker is analyzed and converted 
into a time sequence of feature vectors in the same manner 
as the analysis process employed by the analyzer 1, and the 
feature vectors are registered as reference patterns in units of 
isolated words, connected words, or phonemes in the refer- 

^ ence pattern memory 2. Weighting information for respec- 
tive categories to be classified is established in advance with 
respect to these reference patterns. 

The matching unit 12 carries out a matching of dynamic 
time warping between the time sequence of the feature 

45 vectors of the input speech converted by the analyzer 1 and 
the reference patterns stored in the reference pattern memory 
2. The matching algorithm between the two patterns is 
preferably one of the algorithms which take into account 
nonlinear expansion and contraction in the time domain 

50 because the time sequence of the input speech and the 
reference patterns are easily expanded and contracted in the 
time domain. The algorithms which take into account non- 

■ linear expansion and contraction in the time domain include 
a DP (Dynamic Prc>gramming) matching method, a HMM 

55 (Hidden Markov Model) matching method, and so on. In the 
description given below, the DP matching which is widely 
used in the art of present speech recognition will be 
explained. 

If it is assumed that symbols "i", "j" represent time frames 
60 (i=0 to I), (j=0 to J) of a respective input speech and a 
reference pattern, and the symbol "c" represents a vector 
component, then the time sequence of the feature vectors of 
input speech are indicated by X(i, c). and the time sequence 
of the reference pattern are indicated by Y(j, c). 
65 The input speech and the reference patterns make up a 
two-dimensional space composed of grid points (i, j), and a 
minimum path of accumulated distances, among paths from 
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a starting end (0, 0) to a terminal end (L J), is regarded as the respective acoustic categories back along the optimum 

an optimum association between the two patterns, and the path from a grid point (I, J) to a grid point (0, 0) are 

accumulated distances are referred to as the distance calculated as follows: 

between the patterns. According to speech recognition based m a first step, the values of i, j, N(c), S(c) are set 

on the DP matching, distances between the input speech and 5 respectively to I, J, 0, 0 as follows: 

all the reference patterns are calculated, and the acoustic - l= ^ 

category of one of the reference patterns which gives a :_j 

minimum distance is outputted as the result of speech N(c>=0 and 

recognition. S(c)=0.' 

If the DP matching is carried out for adaptation or 10 ^ a second ste „ the type of the acoustic category of the 

learning then since a reference pattern and the speech to be ^ d int (i fa checke(L tf it is a h cat ^ 

comparedare already limited, the DPmatohing has its object c) is Elated and tf ^ a noise category, 

todeterminc |amean value of featore vectors m each acoustic ^ Nc=Nc)+ x(L c ) is calculated 

category when an optimum time-alignment is obtained \?{ : \ J Jr , vt ^ . . , . Tf w 4U 

w^.T ' . _ „iu~ ft,,„ P Jntu In a third step, the values of 1 and 1 are checked. If both 

between two patterns, rather than speech recognition. . * i . , _ . . - . 

Distances dUj)be^^^ 15 ™ °;? c » mc ^^sing jumps to a fi^ st^, and if 1 or j 

j) of the time sequence X(i, c) of the feature Actors of the ls *« *f processing proceeds to a fourth step 

input speech and the time sequence Y(j, c) of the feature . fom * S *P »j" decremented by 1, and the transition 

vectors of the reference patterns are defined as follows: mformation B(i, j) of me gnd point (1, j) is put mj as follows: 

i=d-l, and 

r c -, 20 j=B(i, j). 

rf(v)=min 2 \X(i,c) -)<*>(/, c)P Thereafter, the processing returns to the second step, and the 

* 1 second and following steps are repeated. 

In the fifth step, the contents of N(c), S(c) are divided by 

where k represents a kth template at respective grid point A the number of times which are respectively summed up, and 

distance for each grid point corresponds to the minimum one 25 the mean values in the respective acoustic categories are 

of the distances given by plural ks templates. calculated. The processing is now completed. 

According to the DP matching, the accumulated distances In the conventional acoustic category mean value calcu- 

D(i, j) associated with the grid points (i, j) are indicated by lating apparatus, the backtracking process is carried out by 

the following recursive equation: going from a grid point position composed of a terminal end 

30 point of input speech and a terminal end point of a reference 

D{ij) pattern back toward a starting end to associate the input 

D(ij) = d(ij) + min D(ij - 1) speech and the reference pattern in a two-dimensional space. 

_ 2 j Mean vectors of the input speech are calculated in respective 

categories of the reference pattern that has been associated 

35 by the backtracking process, and outputted as acoustic 

Specifically, accumulated distances D are calculated in a category mean values, 

direction for the input speech to increase in time, using the Since the conventional acoustic category mean value 

grid point (0, 0) as a starting point and the initial value D(0, calculating apparatus is required to search in the two- 

0) as d(0, 0), and when accumulated distances up to the final dimensional space in both the matching process that is 

grid point (L J) are determined, an optimum matching path 40 executed by the matching unit 12 and the backtracking 

between the two patterns is considered to be determined process that is executed by the mean vector calculator 18, 

The backtracking information that is stored in the back- the conventional acoustic category mean value calculating 

tracking information memory 5 is transition information B(i, apparatus has been disadvantageous in that it needs a large 

j) of the respective grid points which is expressed as follows: amount of calculations and hence is not suitable for real- 

45 time operation. Furthermore, inasmuch as the backtracking 

D(v) process that is executed by the mean vector calculator 18 

B(ij) = argmiity D(ij- 1) cannot be started unless the matching process that is 

p^ij 2) executed by the matching unit 12 is finished the backtrack- 
ing process and the matching process cannot be executed 

50 simultaneously parallel to each other, i.e., they cannot be 

where argrniri^ represents the selection of any one of the executed by way of so-called pipeline processing. This also 

values j, j-1, j-2 which gives D a m i nimum value, as the makes the conventional acoustic category mean value cal- 

value of a j component. culating apparatus incapable of real-time operation. 

The template information T(i, j) which is stored in the Even if the number of acoustic categories to be classified 

template information memory 16 is represented by: 55 | s the conventional acoustic category mean value 

calculating apparatus necessarily needs a large memory as a 

7[ij) = argnmnn \ £ ix&c) - Y®(j,c)P 1 two-dimensional storage area for carrying out the backtrack- 

1 0=1 J ing process. For this reason, it has been impossible to make 

the conventional acoustic category mean value calculating 

The backtracking process that has heretofore been carried 60 apparatus inexpensive, 

out by the conventional mean vector calculator 18 will be ^ TT> „ , 4TW ^„ ^„^ m ^ T 

described below with respect to a simple example where the SUMMARY OF THE INVENTION 

number of acoustic categories to be classified is 2, ie., input It is therefore an object of the present invention to provide 

speech is divided into a noise portion and a speech portion, an acoustic category mean value calculating apparatus and 

and their mean values are determined 65 an adaptation apparatus which require a reduced memory 

If the mean values of noise and speech portions are size for adaptation of input speech or reference patterns, and 

indicated respectively by N(c), S(c), then the mean values in operate efficiently without need for a backtracking process. 
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According to the present invention, there is provided an which the transition is made, to values stored in the same 

apparatus for calculating a mean value in each acoustic acoustic category in the frame in the weight counters, and 

category by matching, with a matching unit, a time sequence stores the sum in the weight counters, and a mean value 

of feature vectors which have been converted from input calculator for dividing accumulated values of weighted 

speech by an analyzer and reference patterns stored in a 5 feature vectors in final frames of the matching in the 

reference pattern memory and composed of a time sequence respective acoustic categories stored in the feature vector 

of feature vectors that have been converted from training accumulating buffers by the accumulated values of weights 

speech sounds and contain weighting information for each in the acoustic categories stored in the weight counters, and 

acoustic category, to effect a time-alignment in each frame, outputting the quotients as mean values in the acoustic 

for thereby calculating a mean value in each of the acoustic 10 categories of the feature vectors of the input speech to be 

categories, the apparatus comprising as many feature vector calculated 

accumulating buffers as the number of acoustic categories in According to the present invention, there is also provided 

each frame of the reference patterns, for storing an accu- an apparatus for calculating a mean value in each acoustic 

mulated value of weighted feature vectors in the acoustic category by matching, with a matching unit, a time sequence 

categories, as many weight counters as the number of 15 of feature vectors which have been converted from input 

acoustic categories in each frame of the reference patterns, speech by an analyzer and reference patterns stored in a 

for storing an accumulated value of weights of feature reference pattern memory and composed of a time sequence 

vectors added in the acoustic categories in each frame, a of feature vectors that have been converted from training 

feature vector integrator which, each time the matching unit speech sounds and contain added weighting information for 

effects a matching process in each frame, adds values which 20 each acoustic category, to effect a time-alignment in each 

have been produced by multiplying the feature vectors of the frame, for thereby calculating a mean value in each of the 

input speech to be calculated in the frame by the weight in acoustic categories, and for adapting at least one of the input 

each of the acoustic categories, to values stored in the frame, speech and the reference patterns using the mean value in 

from which a transition is made and which has been sub- each of the acoustic categories, the apparatus comprising 

jected to matching immediately before, in the feature vector 25 two acoustic category mean value calculating sections each 

accumulating buffers, stores the sums in the frame in the comprising as many feature vector accumulating buffers as 

feature vector accumulating buffers, adds the weight in each the number of acoustic categories in each frame of the 

of the acoustic categories in the frame, from which the reference patterns, for storing an accumulated value of 

transition is made, to values stored in the same acoustic weighted feature vectors in the acoustic categories, as many 

category in the frame in the weight counters, and stores the 30 weight counters as the number of acoustic categories in each 

sum in the weight counters, and a mean value calculator for frame of the reference patterns, for storing an accumulated 

dividing accumulated values of weighted feature vectors in value of weights of feature vectors added in the acoustic 

final frames of the matching in the respective acoustic categories in each frame, a feature vector integrator which, 

categories stored in the feature vector accumulating buffers each time the matching unit effects a matching process in 

by the accumulated values of weights in the acoustic cat- 35 each frame, adds values which have been produced by 

egories stored in the weight counters, and outputting the multiplying the feature vectors of the input speech to be 

quotients as mean values in the acoustic categories of the calculated in the frame by the weight in each of the acoustic 

feature vectors of the input speech to be calculated. categories, to values stored in the frame, from which a 

According to the present invention, there is also provided transition is made and which has been subjected to matching 
an apparatus for calculating a mean value in each acoustic 40 immediately before, in the feature vector accumulating 

category by matching, with a matching unit, a time sequence buffers, stores the sums in the frame in the feature vector 

of feature vectors which have been converted from input accumulating buffers, adds the weight in each of the acoustic 

speech by an analyzer and reference patterns stored in a categories in the frame, from which the transition is made, 

reference pattern memory and composed of a time sequence to values stored in the same acoustic category in the frame 
of feature vectors that have been converted from training 45 in the weight counters, and stores the sum in the weight 

speech sounds and contain weighting information for each counters, and a mean value calculator for dividing accumu- 

acoustic category, to effect a time-alignment in each frame, lated values of weighted feature vectors in final frames of the 

for thereby calculating a mean value in each of the acoustic matching in the respective acoustic categories stored in the 

categories, the apparatus comprising two acoustic category feature vector accumulating buffers by the accumulated 
mean value calculating sections each comprising as many 50 values of weights in the acoustic categories stored in the 

feature vector accumulating buffers as the number of acous- weight counters, and outputting the quotients as mean values 

tic categories in each frame of the reference patterns, for in the acoustic categories of the feature vectors of the input 

storing an accumulated value of weighted feature vectors in speech to be calculated. 

the acoustic categories, as many weight counters as the According to the present invention, there is also provided 
number of acoustic categories in each frame of the reference 55 a method of calculating a mean value in each acoustic 

patterns, for storing an accumulated value of weights of category by matching a time sequence of feature vectors 

feature vectors added in the acoustic categories in each which have been converted from input speech and reference 

frame, a feature vector integrator which, each time the patterns composed of a time sequence of feature vectors that 

matching unit effects a matching process in each frame, adds have been converted from training speech sounds and con- 
values which have been produced by multiplying the feature 60 taming added weighting information for each acoustic 

vectors of the input speech to be calculated in the frame by category, to effect a time-alignment in each frame, for 

the weight in each of the acoustic categories, to values stored thereby calculating a mean value in each of the acoustic 

in the frame, from which a transition is made and which has categories, the method comprising the steps of, each time the 

been subjected to matching immediately before, in the input speech is matched with the reference patterns succes- 
feature vector accumulating buffers, stores the sums in the 65 sively from a first frame of the reference patterns and a 

frame in the feature vector accumulating buffers, adds the transition is made to a next frame, integrating values which 

weight in each of the acoustic categories in the frame, from have been produced by multiplying the feature vectors of the 
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input speech to be calculated in the frame in which the input verted into a time sequence of feature vectors by utilizing an 
speech is matched with the reference patterns, by the weight input speech feature vector integrator which matches refer- 
in each of the acoustic categories in the frame, and holding ence patterns with input speech, and which integrates and 
the integrated values in each frame, integrating weights in stores weighted feature vectors and weights of the reference 
the respective acoustic categories in the frame in which the 5 patterns; 

input speech is matched with the reference patterns, and FIG. 4 is a block diagram of an acoustic category mean 
holding the integrated weights in each frame, after the input value calculating apparatus in which input speech is con- 
speech is matched with the reference patterns in a final verted into a time sequence of feature vectors by utilizing 
frame, dividing a weighted accumulated value of the feature two input speech feature vector integrators; and 
vectors in each of the acoustic categories in the final frame, 10 FIG. 5 is a block diagram of an adaptation apparatus in 
by a weighted accumulated value of weights in the corre- which input speech is converted into a time sequence of 
sponding acoustic categories in the frame, and outputting a feature vectors by utilizing an input speech feature vector 
quotient as a mean value in each of the acoustic categories. integrator, and which includes an adaptation unit 
In the above method, an accumulation of weighted feature DETAILED DESCRIPTION OF THE 
vectors in each of the acoustic categories with respect to the 15 PREFERRED EMBODIMENTS 
feature vectors extracted from the input speech, and an . . ^ 

accumulated value of weights thereof may be calculated to As shown m HG< 2 ' 311 acoustlc category mean value 

output a mean value in each of the acoustic categories of the calculatin S 20 according to a first embodiment of 

feature vectors of the input speech. the invention comprises a matching unit 22 for 

T *i. u j . ^ r . , jo comparing a time sequence of feature vectors which have 

tallie*^ ^converted from input speech by an analyzer 1 and 

vectors in each of the acoustic categories with respect to fce reference patlems aWerence pattern memory 2, 

feature vecton of ^e reference patt^ { ft ^ g e of feamre ^ ^ haye ^ 

vdue of weights ; mereof may be calcuktedto oujut a mean CQnverted training , h sounds ^ added 

flT T U 8 ° neS 25 weighting information for each acoustic category, to effect a 

o e r erence pa rns. normalization matching, i.e., a time-alignment between the 

In the above method, an accumulation of weighted feature mput speech and the reference patterns, as many input 

vectors in each of the acoustic categories with respect to the speech feature vector accumulating buffers 26 as the number 

feature vectors extracted from the input speech and the 0 f acoustic categories in each frame of the time sequence of 

feature vectors of the reference patterns, and an accumulated 3Q feature vectors of the reference patterns, for storing an 

value of weights thereof may be calculated to output mean accumulation of weighted feature vectors of the input 

values in each of the acoustic categories of the feature speech, weight counters 27 for storing weights accumulated 

vectors of the input speech and the reference patterns in each of the acoustic categories of the feature vectors of the 

simultaneously with each other. input speech wn i ch m ^ored in the input speech feature 

An adaptation apparatus according to the present inven- 35 vector accumulating buffers 26, an input speech feature 

tion has an adaptation unit for adapting at least one of input vector integrator 24 which, each time the matching unit 22 

speech and reference patterns using mean values in the effects a matching process in a frame at each of the times and 

respective acoustic categories which are calculated by the makes a transition to a next frame, adds values which have 

apparatus for calculating a mean value in each acoustic been produced by multiplying the feature vectors of the 

category according to the present invention. 40 input speech in the frame by the weight in each of the 

Because a mean value in each acoustic category is cal- acoustic categories, to values stored in the frame, from 

culated by the apparatus at the time of completion of the which the transition is made, of the same acoustic category 

matching process, the calculation process may be carried out m we input speech feature vector accumulating buffers 26, 

in one stage, and hence may require a reduced memory size stores the sums in the frame in the input speech feature 

and operate at a high speed. Since the apparatus is capable 45 vector accumulating buffers 26, adds the weight in each of 

of simultaneously effecting the matching process and the me acoustic categories in the frame, from which the transi- 

mean vector integrating process, the apparatus is able to ti° n * s made, to values stored in the same acoustic category 

effect parallel calculations by way of pipeline processing i fl toe frame in the weight counters 27, and stores the sum 

and hence carry out real-time processing. in the weight counters 27, and a mean value calculator 28 

The above and other objects, features, and advantages of 50 which, after the matching process effected by the matching 

the present invention will become apparent from the fol- umt 22 and me accumulating P rocess effected b Y ^ ™P«t 

lowing description referring to the accompanying drawings s P cech feature vector taegrator 24, divides values of final 

which illustrate an example of preferred embodiments of the frame P 0 ^ 0 ™ * fce respective acoustic categories in the 

present invention. * n P at s P eecn feature vector accumulating buffers 26 by the 

55 values in the corresponding weight counters 27, and outputs 

BRIEF DESCRIPTION OF THE DRAWINGS ^ <3 uoti ents as mean values in the acoustic categories of the 

input speech. 

FIG. 1 is a block diagram of a conventional acoustic ^ ^ speech is converted into a time sequence of 

category mean value calculating apparatus; feature vectors by me ana iyzed 1 in the same manner as with 

FIG. 2 is a block diagram of an acoustic category mean go the conventional analyzer 1. The feature vectors of the input 

value calculating apparatus in which input speech is con- speech which have been converted by the analyzer 1 are 

verted into a time sequence of feature vectors by utilizing an associated in the time domain with the reference patterns 

input speech feature vector integrator which multiplies the stored in the reference pattern memory 2 by a known 

feature vectors of the input speech in a frame by a weight in dynamic time warping matching process such as the DP 

each acoustic category; 65 matching or the HMM matching. 

FIG. 3 is a block diagram of an acoustic category mean It is assumed that the frames of the input speech and 

value calculating apparatus in which input speech is con- reference patterns, i.e., discrete times, are represented by i 
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(i=l to I) and j (j=0 to J), respectively, the time sequence of 
the feature vectors of the input speech is represented by X(i f 
c), and the time sequence of the feature vectors of the 
reference patterns is represented by Y^Q, c), where c is a 
suffix representing the channel components of the feature 
vectors and k is a selected template. There are as many 
feature vector accumulating buffers 26 and as many weight 
counters 27 as the number of categories p and the number of 
reference patterns j, and they are represented respectively by 
VUc), V/0). 

The input speech feature vector integrator 24 effects the 
following processing upon each transition mat is carried out 
for each grid point by the matching unit 22, assuming that 
a selected template is represented by k* and a selected 
transition by j f : 

where w^Q) is the weight of a category p. i.e„ a quantity 
indicating how much a frame j belongs to the category p, and 
is determined in advance with respect to each frame j of a 
reference pattern. The weight vfQ) is of a large value if the 
degree to which the frame j belongs to the category p is 
large, and of a small value if the degree to which the frame 
j belongs to the category p is small. In the simplest case, it 
is possible to set the weight vfQ) to 1 for only the category 
to which the frame j belongs, and set it to 0 for the other 
categories. In this case, a simple mean value rather than a 
weighted mean value is determined. 

At the time the calculations carried out by the matching 
unit 22 are finished, i.e., at the time an optimum path from 
a grid point (0, 0) to a grid point (I, J) is determined, a 
weighted accumulation and an accumulation of integrated 
weights in each of the acoustic categories associated along 
the optimum path are stored in end frame positions (I, J) in 
the feature vector accumulating buffers 26 and the weight 
counters 27. 

The mean value calculator 28 divides the values stored in 
the feature vector accumulating buffers 26 assigned to the 
final frames in the respective acoustic categories of the 
reference patterns, by the values stored in the corresponding 
weight counters 27, thereby to determine mean values V(J, 
c) in the respective acoustic categories of the input speech. 

It is assumed, as with the conventional process, that 
acoustic categories to be classified comprise two types of 
templates, i.e., speech and noise, and the weight with respect 
to each category has a value of i or 0 for the sake of brevity, 
e.g., a category of input speech is identified as being of either 
speech or noise. It is also assumed that a noise portion of the 
feature vector accumulating buffers 26 is represented by V(j, 
c), a speech portion thereof by W(j, c), a noise portion of the 
weight counters 27 by Vc(j), and a speech portion thereof by 
Wc(j). 

As in the conventional process, the matching unit 22 starts 
to execute a matching process from a starring point at a grid 
point (0, 0) of each frame with an accumulated distance D(i, 
j>=D(0, 0), and progressively executes the matching process 
in the direction in which the input speech Y(j, c) increases, 
thereby calculating the accumulated distance, until finally a 
grid point (I, J) is reached. 

Depending on the transition of each grid point X, Y in the 
matching unit 22, the feature vector integrator 24 operates as 
follows: 

If the feature vectors Y^Q, c) of the input speech are the 
template of speech, then the feature vectors X(i, c) of the 



,094 

10 

grid point to which a transition is made are added to the 
speech portion W(j\ c) of the feature vector accumulating 
buffers 26, and 1 is added to the speech portion WcQO of the 
weight counters 27, as follows: 

5 

vo»=VG» 

WV,c)=W(f>cW(i,c) 
10 VcG^VcO*) 

If the feature vectors Y (A °(j, c) of the input speech are the 
15 template of noise, then the feature vectors X(i, c) of the grid 
point to which a transition is made are added to the noise 
portion V(j\ c) of the feature vector accumulating buffers 26, 
and 1 is added to the noise portion VcQ') of the weight 
counters 27, as follows: 

20 

WUc)=WGVe) 
^ VcC/>Vc</>1 
Wc(j)=Wc(D. 

When the matching process effected by the matching unit 
22 reaches the grid point (L J) and hence an optimum path 

30 from the grid point (0, 0) is determined, an accumulated 
value of the feature vectors and an accumulated value of 
weights in each of the acoustic categories associated along 
the optimum path are determined in the feature vector 
accumulating buffer 26 and the weight counter 27 that 

35 correspond to the final grid point (I, J). 

Therefore, when the matching process effected by the 
matching unit 22 is finished, the mean value calculator 28 
divides the value of the feature vector accumulating buffer 
26 corresponding to the final grid point (L J) by the value of 

40 the weight counter 27 thereby to determine a mean value in 
each acoustic category of the input speech, i.e., a mean value 
V(J, c) of the noise portion of the input speech and a mean 
value W(J, c) of the speech portion thereof. 
If the number of acoustic categories to be classified is 

45 small, then the memory size may be smaller than that of the 
conventional apparatus. For example, for a general scale in 
which the number of acoustic categories to be classified is 2, 
the length of reference patterns is 100, the length of input 
speech is 200, and the number of dimensions of feature 

50 vectors is 20, the conventional apparatus has required a 
memory size of 100x200x2=40,000 for storing backtracking 
and template information, whereas the apparatus according 
to the present invention requires a memory size of only 
100x2x20+100x2=4,200 for storing backtracking and tem- 

55 plate information. Therefore, since the memory size of the 
acoustic category mean value calculating apparatus accord- 
ing to the present invention is about Vio of that of the 
conventional apparatus, the cost of the acoustic category 
mean value calculating apparatus according to the present 

60 invention may be lower than that of the conventional appa- 
ratus. 

FIG. 3 shows an acoustic category mean value calculating 
apparatus 30 according to a second embodiment of the 
present invention. As shown in FIG. 3, the acoustic category 
65 mean value calculating apparatus 30 comprises a matching 
unit 22 which is identical to the matching unit 22 of the 
acoustic category mean value calculating apparatus 20 
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according to the first embodiment shown in FIG. 2, a feature the acoustic category mean value calculating apparatus 20 
vector integrator 34, feature vector accumulating buffers 36, according to the first embodiment and an acoustic category 
weight counters 37, and a mean value calculator 38. While mean value calculating section which is identical to the 
the feature vector integrator 24 of the acoustic category acoustic category mean value calculating apparatus 30 
mean value calculating apparatus 20 according to the first 5 according to the second embodiment these acoustic cat- 
embodiment integrates the feature vectors of input speech in egory mean value calculating sections being coupled to each 
the feature vector accumulating buffers 26 and the weight other in one apparatus. When the matching process carried 
counters 27. the feature vector integrator 34 differs there- out by the matching unit 22 is completed, the acoustic 
from in that it matches reference patterns stored in the category mean value calculating apparatus 40 can calculate 
reference pattern memory 2 with input speech, and inte- 10 mean values in acoustic categories of both the input speech 
grates and stores weighted feature vectors and weights of the and the reference patterns simultaneously with each other, 
reference patterns in the feature vector accumulating buffers According to the third embodiment, it is possible to adapt 
36 and the weight counters 37, and the mean value calculator both the input speech and the reference patterns in order to 
38 outputs a mean value of the reference patterns. determine mean values in acoustic categories of both the 

Therefore, the arrangements and operation of the parts of 15 input speech and the reference patterns after the degrees of 

the acoustic category mean value calculating apparatus 30 nonlinear expansion and contraction of both the input speech 

are substantially the same as those of the acoustic category and the reference patterns have been equalized The acoustic 

mean value calculating apparatus 20 according to the first category mean value calculating apparatus 40 is therefore 

embodiment higher in performance. 

It is assumed that the time sequence of feature vectors of 20 FIG. 5 shows an acoustic category mean value calculating 

input speech is represented by X(i,c), and the time sequence apparatus 50 according to a fourth embodiment of the 

of feature vectors of reference patterns is represented by present invention. As shown in FIG. 5, the acoustic category 

Y^Q, c) where i, j represent frames (discrete times), respec- mean value calculating apparatus 50 comprises an acoustic 

tively of the input speech and the reference patterns, c is a category mean value calculating apparatus which is identical 

suffix representing the channel components of the feature 25 to the acoustic category mean value calculating apparatus 20 

vectors, and k is a selected template. There are as many according to the first embodiment and an adaptation unit 55 

feature vector accumulating buffers 36 as the number of connected to the acoustic category mean value calculating 

reference patterns j of categories p, and they are represented apparatus. Using mean values in the respective categories of 

by W?(j, c). Similarly, the weight counters 37 are repre- input speech which have been calculated by the acoustic 

sented by W/(j). 30 category mean value calculating apparatus, the reference 

The feature vector integrator 34 effects the following patterns stored in the reference pattern memory 2 are 

processing upon each transition that is carried out for each adapted to generate new reference patterns, 

grid point by the matching unit 22, assuming that a selected Operation of the acoustic category mean value calculating 

template is represented by k, a selected transition by j', and apparatus 50 according to the fourth embodiment which is 

the weight of a category determined in advance for each 35 arranged to effect speaker adaptation in the same manner as 

frame j of the reference patterns is represented by w^Q): with the literature 1 referred to above will be described 

below. 

An adaptation vector Aj of acoustic categories is deter- 
ge/, c>=wtf c)*K*(0y*\* c) mined from a mean value [u y ] with respect to acoustic 
wUfcw'ijWij) 40 categories j of input speech which is calculated by the 

acoustic category mean value calculating apparatus and a 

When the matching process effected by the matching unit predetermined mean value \ij with respect to acoustic cat- 

22 reaches the grid point (I, J) and hence an optimum path egories j of reference patterns, as follows: 
from the grid point (0, 0) is determined, an accumulated 

value of the feature vectors and an accumulated value of 45 

weights in each of the acoustic categories associated along a^QjJ-^ 

the optimum path are determined in the feature vector . . 

aerating buffer 36 and the weight counter 37 that , T 1 * to a ~ usbc 1 of . reference 

correspond to the final grid point ft J) «* TT- "ZT*- T" ^ ^ 

TOerefore. when the mailing process effected by the SO ^ adapUtion vector ^.s de^ed from acoustoc catego- 

matching unit 22 is finished, the mean value calculator 38 ne t s * °f Terence patterns for input speech with acoustic 

divides L value of the feature vector accumulating buffer f a ' e 8 ones ^ «««« * e same spectral interpo- 

26conespondingtothefinalgridpoint(I, J)by thevalueof latlon as wlth *° above hterature 35 foUows: 
the weight counter 27 thereby to determine a mean value 

W^Xj, c) in each acoustic category of the reference patterns. 55 = j Wi}Aj 

According to the second embodiment, after the reference 

patterns have been nonlinearly processed in the same man- TI . , . . . ^ ^ „ 

~_ -*u *u • * u i u * Using these adaptation vectors, the adaptation unit 55 

ner as with the input speech, a mean value in each category • «. j * u * ui- u* 

r*u r ** • j nn. * *t carries out adaptation by establishing: 

of the reference patterns can be determined Therefore, the r / & 

accuracy with which the mean value is estimated is 60 

improved, and so is the performance of the acoustic category \iirV*+A 

mean value calculating apparatus 30. 

FIG. 4 shows an acoustic category mean value calculating with respect to all the reference patterns k belonging to the 

apparatus 40 according to a third embodiment of the present acoustic categories i, j, where A is either Ai or Aj selected 

invention. As shown in FIG. 4, the acoustic category mean 65 depending on the type of k. 

value calculating apparatus 40 comprises an acoustic cat- When a mean value I(p, c) with respect to. acoustic 

egory mean value calculating section which is identical to categories p has been deterrnined, i.e., a mean value M(p, c) 
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with respect to acoustic categories p of reference patterns 
has been determined in advance, an adaptation vector A(p, c) 
in each of the acoustic categories is determined by: 

A(p, c)=I<p, c)-M(p, c). 

The adaptation unit 55 adds this adaptation vector in each of 
the acoustic categories of the reference patterns to adapt the 
reference patterns for thereby generating new reference 
patterns. 

While the adaptation vector is used as it is to adapt the 
reference patterns in the above example, a suitable coeffi- 
cient A may be used to establish the following equation: 

MjKd-HxJUjt+Ayd-Kx) 

to control the degree of adaptation for preventing unduly 
large adaptation. 

The acoustic category mean value calculating apparatus 
50 according to the fourth embodiment may be composed of 
a combination of the adaptation unit 55 and either the 
acoustic category mean value calculating apparatus 20 or the 
acoustic category mean value calculating apparatus 30. 

The combination of the adaptation unit 55 and the acous- 
tic category mean value calculating apparatus 50 for extract- 
ing environmental differences, i.e., difference in channel 
distortion and difference in additive noise in the spectral 
domain, between a reference pattern and a short utterance to 
be recognized, and adapting the reference patterns to a new 
environment using the differences, will be described below. 

Experimental results obtained using conventional speech 
recognition apparatus have been reported by Takagi, et aL 
See Takagi, Hattori, and Watanabe, "Speech Recognition 
with Environment Adaptation by Spectrum Equalization, 
Spring Meeting of the Acoustical Society of Japan* 2-P-8, 
pp. 173-174, March, 1994. 

It is assumed that acoustic categories to be classified are 
speech and noise. A mean spectrum Sw of a speech model 
of reference patterns, a mean spectrum Nw of a noise model 
of reference patterns, a mean spectrum Sv of a speech 
portion of input speech, and a mean spectrum Nv of a noise 
portion of the input speech are obtained by an acoustic 
category mean value calculating apparatus. 

A speech model of reference patterns W(t) is adapted by: 

[W(OK(S^V'V(SH^H'))X(W(^-<^H')>fW l , 

and a noise model of reference patterns is adapted by: 

[W(t)]=Nv 

The present invention is also applicable to any adaptation 
or training apparatus which uses a mean value in each 55 
acoustic category, other than the above apparatus. 

It is to be understood that variations and modifications of 
the acoustic category mean value calculating apparatus and 
the adaptation apparatus disclosed herein will be evident to 
those skilled in the art It is intended that all such modifi- 60 
cations and variations be included within the scope of the 
appended claims. 

What is claimed is: 

1. An apparatus for calculating a mean value in each 
acoustic category by matching, with a matching unit, a time 65 
sequence of feature vectors which have been converted from 
input speech by an analyzer and reference patterns stored in 
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a reference pattern memory and composed of a time 
sequence of feature vectors that have been converted from 
training speech sounds and contain weighting information 
for each acoustic category, to effect a time-alignment in each 
5 frame, for thereby calculating a mean value in each of the 
acoustic categories, said apparatus comprising: 
as many feature vector accumulating buffers as the num- 
ber of acoustic categories in each frame of the reference 
patterns, for storing an accumulated value of weighted 
10 feature vectors in the acoustic categories; 

as many weight counters as the number of acoustic 
categories in each frame of the reference patterns, for 
storing an accumulated value of weights of feature 
vectors added in the acoustic categories in each frame; 
a feature vector integrator which, each time the matching 
unit effects a matching process in each frame, adds 
values which have been produced by multiplying the 
feature vectors of the input speech to be calculated in 
^ the frame by the weight in each of the acoustic 
categories, to values stored in the frame, from which a 
transition is made and which has been subjected to 
matching immediately before, in the feature vector 
accumulating butlers, stores the sums in the frame in 
^ the feature vector accumulating buffers, adds the 
weight in each of the acoustic categories in the frame, 
from which the transition is made, to values stored in 
the same acoustic category in the frame in the weight 
counters, and stores the sum in the weight counters; and 
30 a mean value calculator for dividing accumulated values 
of weighted feature vectors in final frames of the 
matching in the respective acoustic categories stored in 
said feature vector accumulating buffers by the accu- 
mulated values of weights in the acoustic categories 
35 stored in said weight counters, and outputting the 
quotients as mean values in the acoustic categories of 
the feature vectors of the input speech to be calculated. 
2. An apparatus for calculating a mean value in each 
acoustic category by matching, with a matching unit, a time 
40 sequence of feature vectors which have been converted from 
input speech by an analyzer and reference patterns stared in 
a reference pattern memory and composed of a time 
sequence of feature vectors mat have been converted from 
training speech sounds and contain weighting information 
for each acoustic category, to effect a time-alignment in each 
frame, for thereby calculating a mean value in each of the 
acoustic categories, said apparatus comprising: 
two acoustic category mean value calculating sections 
each comprising: 

as many feature vector accumulating buffers as the 
number of acoustic categories in each frame of the 
reference patterns, for storing an accumulated value 
of weighted feature vectors in the acoustic catego- 
ries; 

as many weight counters as the number of acoustic 
categories in each frame of the reference patterns, for 
storing an accumulated value of weights of feature 
vectors added in the acoustic categories in each 
frame; 

a feature vector integrator which, each time the match- 
ing unit effects a matching process in each frame, 
adds values which have been produced by multiply- 
ing the feature vectors of the input speech to be 
calculated in the frame by the weight in each of the 
acoustic categories, to values stored in the frame, 
from which a transition is made and which has been 
subjected to matching immediately before, in the 
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feature vector accumulating buffers, stores the sums 
in the frame in the feature vector accumulating 
buffers, adds the weight in each of the acoustic 
categories in the frame, from which the transition is 
made, to values stored in the same acoustic category 
in the frame in the weight counters, and stores the 
sum in the weight counters; and 
a mean value calculator for dividing accumulated val- 
ues of weighted feature vectors in final frames of the 
matching in the respective acoustic categories stored 
in said feature vector accumulating buffers by the 
accumulated values of weights in the acoustic cat- 
egories stored in sail weight counters, and outputting 
the quotients as mean values in the acoustic catego- 
ries of the feature vectors of the input speech to be 
calculated. 

3. An apparatus for calculating a mean value in each 
acoustic category by matching, with a matching unit, a time 
sequence of feature vectors which have been converted from 
input speech by an analyzer and reference patterns stored in 
a reference pattern memory and composed of a time 
sequence of feature vectors that have been converted from 
training speech sounds and contain weighting information 
for each acoustic category, to effect a time-alignment in each 
frame, for thereby calculating a mean value in each of the 
acoustic categories, and for adapting at least one of the input 
speech and the reference patterns using the mean value in 
each of the acoustic categories, said apparatus comprising: 

two acoustic category mean value calculating sections 
each comprising: 

as many feature vector accumulating buffers as the 
number of acoustic categories in each frame of the 
reference patterns, for storing an accumulated value 
of weighted feature vectors in the acoustic catego- 
ries; 

as many weight counters as the number of acoustic 
categories in each frame of the reference patterns, for 
storing an accumulated value of weights of feature 
vectors added in the acoustic categories in each 
frame; 

a feature vector integrator which, each time the match- 
ing unit effects a matching process in each frame, 
adds values which have been produced by multiply- 
ing the feature vectors of the input speech to be 
calculated in the frame by the weight in each of the 
acoustic categories, to values stored in the frame, 
from which a transition is made and which has been 
subjected to matching immediately before, in the 
feature vector accumulating buffers, stores the sums 
in the frame in the feature vector accumulating 
buffers, adds the weight in each of the acoustic 
categories in the frame, from which the transition is 
made, to values stored in the same acoustic category 
in the frame in the weight counters, and stores the 
sum in the weight counters; and 

a mean value calculator for dividing accumulated val- 
ues of weighted feature vectors in final frames of the 
matching in the respective acoustic categories stored 
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in said feature vector accumulating buffers by the 
accumulated values of weights in the acoustic cat- 
egories stored in said weight counters, and output- 
ting the quotients as mean values in the acoustic 
5 categories of the feature vectors of the input speech 

to be calculated. 

4. A method of calculating a mean value in each acoustic 
category by matching a time sequence of feature vectors 
which have been converted from input speech and reference 

io patterns composed of a time sequence of feature vectors that 
have been converted from training speech sounds and con- 
taining added weighting information for each acoustic 
category, to effect a time-alignment in each frame, for 
thereby calculating a mean value in each of the acoustic 
15 categories, said method comprising the steps of: 

each time the input speech is matched with the reference 
patterns successively from a first frame of the reference 
patterns and a transition is made to a next frame, 
integrating values which have been produced by mul- 
20 tiplying the feature vectors of the input speech to be 
calculated in the frame in which the input speech is 
matched with the reference patterns, by the weight in 
each of the acoustic categories in the frame, and 
holding the integrated values in each frame; 
25 integrating weights in the respective acoustic categories in 
the frame in which the input speech is matched with the 
reference patterns, and holding the integrated weights 
in each frame; 
after the input speech is matched with the reference 
30 patterns in a final frame, dividing a weighted accumu- 
lated value of the feature vectors in each of the acoustic 
categories in the final frame, by a weighted accumu- 
lated value of weights in the corresponding acoustic 
categories in the frame; and 
35 outputting a quotient as a mean value in each of the 
acoustic categories. 

5. A method according to claim 4, wherein an accumu- 
lation of weighted feature vectors in each of the acoustic 
categories with respect to the feature vectors extracted from 

40 the input speech, and an accumulated value of weights 
thereof are calculated to output a mean value in each of the 
acoustic categories of the feature vectors of the input speech. 

6. A method according to claim 4, wherein an accumu- 
lation of weighted feature vectors in each of the acoustic 

45 categories with respect to the feature vectors of the reference 
patterns, and an accumulated value of weights thereof are 
calculated to output a mean value in each of the acoustic 
categories of the feature vectors of the reference patterns. 

7. A method according to claim 4, wherein an accumu- 
50 lation of weighted feature vectors in each of the acoustic 

categories with respect to the feature vectors extracted from 
the input speech and the feature vectors of the reference 
patterns, and an accumulated value of weights thereof are 
calculated to output mean values in each of the acoustic 
55 categories of the feature vectors of the input speech and the 
reference patterns simultaneously with each other. 

***** 
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